Collection of enhancements and fixes by csnover · Pull Request #9 · reloginn/lsonar

csnover · 2025-09-01T22:29:54Z

Hello,

This PR contains the previous three PRs that I opened separately, plus additional patches. Changes include:

Adding repository to Cargo.toml (Add repository to Cargo.toml #6)
Replacing use of Rust strings with byte arrays, since Lua strings are not UTF-8 (Replace Rust strings with byte arrays #7)
Add support of the inverted character classes (Implement inverted character class #8)
Enabling and fixing pedantic clippy lints
Using a typed Error API with token position information
Additional convenience functions internally to make parsing code a bit clearer
Remove all inefficient boxing, reference counting, collecting, cloning, etc. in lieu of references with appropriate lifetimes
Replace use of HashMap in the API with a simple callback function so the table type is agnostic to the consumer
Implement position capture groups (this got smeared across a couple of commits, apologies for the mess)
Restructure the public API to organise public modules by function, hide internals, remove excess Options
Document the public APIs
Add continuous integration using basic GH workflow

The edition and MSRV are bumped to use the stabilised if-and-while-let chain syntax. This isn’t strictly necessary but it makes the code cleaner in areas and I don’t know why not to do this.

The error handling is not quite as good as it could be, but probably more than good enough, given the scope of the library. In particular, there are redundant variants for unexpected tokens because I ran out of interest in continuing to work on it. syn’s Lookahead1 would be a good way to collect all possible expected tokens into something that can then be collected into an error.

I would recommend also to convert the internal engine tests (in src/engine/tests.rs) to ones that check the public APIs.

I suppose this should make the library more or less 1.0-ready, since the API is well-defined and limited, but I only bumped the pre-release number due to the many API-breaking changes.

OK, that‘s it from me for now. I will be able to answer questions or maybe make minor changes but don’t anticipate working on this any more from now (unless I run into some other bug).

Thanks!

Lua strings are not UTF-8 and so it is incorrect to use Rust strings. In particular, some Lua libraries specifically designed to add support for Unicode, like ustring, use `string.find` to look for non-ASCII characters in Lua strings, which is impossible to do when Lua strings are treated as UTF-8 (a character set of `[\x80-\xff]` is ill-formed). This is an API-breaking change. Note that downstream consumers like piccolo will now convert using `IntoValue for Vec<T>` instead of `IntoValue for StdString` and so those need to be changed to call `ctx.intern` explicitly instead of relying on the auto-conversion.

PIL 20.2: An upper case version of any of those classes represents the complement of the class.

Because this changes the signature of an exported function to remove an unnecessary `Result` and a struct to eliminate an unnecessary `Box`, this patch contains API-breaking changes.

This makes the replacement method agnostic to whatever type the consumer is using instead of requiring the table to be collected into a `HashMap`. This is an API-breaking change.

Lua 5.4 § 6.4.1: > As a special case, the capture () captures the current string > position (a number). For instance, if we apply the pattern > "()aa()" on the string "flaaap", there will be two captures: 3 > and 5.

reloginn · 2025-09-02T06:32:04Z

Hello! I will definitely review everything in the next two days and provide comments. Thank you so much!

This is necessary for interleaving VM function calls where a function callback cannot complete synchronously, without forcing the normal API to be async.

csnover · 2025-09-02T22:23:13Z

Thanks! After everything, it was also necessary to add interruptible gsub so I pushed a change for that. At the same time I noticed a couple places where I forgot to add Eq markers. I’m definitely, probably, done now, pending any feedback.

I also have completed working implementation of these methods for the piccolo VM stdlib now; if you want me to put those somewhere or to send something upstream to supersede your PR, let me know.

Closely associated types can be stored in the same module. This reduces the amount of re-exporting that has to happen to get everything out, reduces file switching during development, and may decrease the chance of creating confusing type name conflicts.

Due to current non-conformance in the engine, this causes tests to fail.

It is not possible for the lexer to know the nesting level of sets since the pattern `[]]` is valid. There were bad tests in the lsonar test suite which did not conform to PUC-Lua which have been changed for conformance.

This represents the null character and is tested in the PUC-Lua test suite.

6.4.1: > %x: (where x is any non-alphanumeric character)

For example, anchors.

6.4.1: > The beginning and the end of the subject are handled as if they > were the character '\0'.

6.4.1: > A pattern item can be […] a single character class followed by > '*' […] Captures, balances, and frontiers are not character classes.

The previous design, while beautiful, was inefficient and not capable of correctly handling non-greedy matching inside a tree node. An example of this is with the common pattern for trimming strings, `^%s*(.-)%s*$`. A non-greedy match needs to take one token per iteration until the *entire pattern* fails to parse, but this is not possible when only a subtree is visible to the repetition function. Trying to deal with this by naïve linearisation means that the edge of the capture is lost. I imagine that there is some technique that would allow this to work (I am no parser expert) but the thing about Lua patterns is that they are way too simple for this level of complexity in the first place. So, this commit deletes the old engine and replaces it with a hand-translation of the PUC-Lua pattern matching code which works and passes the Lua test suite.

csnover · 2025-09-05T00:51:53Z

Hi again. This PR has totally changed because somehow it took me way too long to see that this library as written was not actually matching according to actual Lua language? I’m very confused about what source was used to decide what Lua patterns look like. A lot of time was wasted.

For posterity, I retained the historical attempts to change the old design to actually conform, but it was impossible due to the broken non-greedy backtracking. As an academic point I would enjoy knowing whether there is some way to actually make the old design not be broken in this way without making it even more inefficient, but in real life this new one will work better. Since it does not create an IR, it is substantially more efficient. It could be touched up to be more Rust-y, but I’ve spent too much time on this already.

Best,

csnover added 15 commits August 31, 2025 14:58

Add repository to Cargo.toml

86e288d

Implement inverted character class

5ad65ee

PIL 20.2: An upper case version of any of those classes represents the complement of the class.

Run rustfmt

fcb818b

Enable and fix most pedantic clippy lints

c9d3639

Because this changes the signature of an exported function to remove an unnecessary `Result` and a struct to eliminate an unnecessary `Box`, this patch contains API-breaking changes.

Update to edition 2024

df708e4

Remove unnecessary allocations and wrappers

9aad229

Use a callback function for Table

65e9b35

This makes the replacement method agnostic to whatever type the consumer is using instead of requiring the table to be collected into a `HashMap`. This is an API-breaking change.

Replace tuples with named types where appropriate

d6c43d2

Implement position capture groups

69a0420

Lua 5.4 § 6.4.1: > As a special case, the capture () captures the current string > position (a number). For instance, if we apply the pattern > "()aa()" on the string "flaaap", there will be two captures: 3 > and 5.

Fix rustdoc error

0e79931

Add CI workflow

c6b333d

Use structured errors

4bc831e

Clean up and document the public API

c510ab1

Bump version number

3741ff7

csnover added 2 commits September 2, 2025 16:49

Expose a stateful GSub implementation

48ecd62

This is necessary for interleaving VM function calls where a function callback cannot complete synchronously, without forcing the normal API to be async.

Add missing Eq marker traits

75ea899

csnover force-pushed the main branch from 0f5a881 to 75ea899 Compare September 2, 2025 21:49

Allow optional replacement values for gsub callbacks

06ab9b7

csnover added 9 commits September 3, 2025 11:07

Remove confusing type aliases with the same name

398287e

Fix wrong function signature of gmatch

d6dd437

Allow mutable replacement functions for gsub

153b623

Simplify AST splitting

9a2a0f7

Reduce module sprawl

f66005f

Closely associated types can be stored in the same module. This reduces the amount of re-exporting that has to happen to get everything out, reduces file switching during development, and may decrease the chance of creating confusing type name conflicts.

Add test cases from Lua test suite

728138b

Due to current non-conformance in the engine, this causes tests to fail.

Send correct values to gsub callbacks

a12420a

Fix class token included in set as literal

bdd4dca

Fix incorrect parsing of []]

8a7c52f

csnover added 17 commits September 3, 2025 23:39

Fix bad parsing of character sets

816fd58

It is not possible for the lexer to know the nesting level of sets since the pattern `[]]` is valid. There were bad tests in the lsonar test suite which did not conform to PUC-Lua which have been changed for conformance.

Support undocumented character class '%z'

93ce830

This represents the null character and is tested in the PUC-Lua test suite.

Change escapable magic byte list to conform to Lua documentation

fd530c6

6.4.1: > %x: (where x is any non-alphanumeric character)

Fix incorrect handling of anchors in middle of pattern

6bbb52e

Fix string capture group '%0'

8d788af

Implement capture reference matching

f9460d2

Simplify previous byte lookup function

ec7d0de

Support implicit capture '%1' in capture strings

8f1d21c

Fix infinite loop in gsub when matching non-advancing patterns

6cac198

For example, anchors.

Fix gmatch not always iterating to the end

635679c

Fix matching after the end of a string

8a93823

Allow crazy character class set ranges

6527ce4

Follow Lua 5.3.3 semantics for empty matches

dca82de

Fix frontier matches at start and end of line

b024f80

6.4.1: > The beginning and the end of the subject are handled as if they > were the character '\0'.

Fix broken qualifier implementation

b07e00d

6.4.1: > A pattern item can be […] a single character class followed by > '*' […] Captures, balances, and frontiers are not character classes.

Remove unused Token::Percent

8a1ee2c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collection of enhancements and fixes#9

Collection of enhancements and fixes#9
csnover wants to merge 44 commits intoreloginn:mainfrom
csnover:main

csnover commented Sep 1, 2025

Uh oh!

reloginn commented Sep 2, 2025

Uh oh!

csnover commented Sep 2, 2025

Uh oh!

csnover commented Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

csnover commented Sep 1, 2025

Uh oh!

reloginn commented Sep 2, 2025

Uh oh!

csnover commented Sep 2, 2025

Uh oh!

csnover commented Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants